Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GA versioning refactor plus fetch new rsm properties. #2974

Merged
merged 12 commits into from
Dec 14, 2023

Conversation

nagworld9
Copy link
Contributor

@nagworld9 nagworld9 commented Nov 3, 2023

Description

Agent update handler handles two type of agent updates. Handler initializes the updater to SelfUpdateVersionUpdater and switch to appropriate updater based on below conditions:
RSM update: This is the update requested by RSM. The contract between CRP and agent is we get following properties in the goal state:
version: it will have what version to update
isVersionFromRSM: True if the version is from RSM deployment.
isVMEnabledForRSMUpgrades: True if the VM is enabled for RSM upgrades.
if vm enabled for RSM upgrades, we use RSM update path. But if requested version is not by rsm deployment
we ignore the update.
Self update: We fallback to this if above is condition not met. This update to the largest version available in the manifest
Note: Self-update don't support downgrade.

Handler keeps the rsm state of last update is with RSM or not on every new goal state. Once handler decides which updater to use, then
does following steps:
1. Retrieve the agent version from the goal state.
2. Check if we allowed to update for that version.
3. Log the update message.
4. Purge the extra agents from disk.
5. Download the new agent.
6. Proceed with update.

Issue #


PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

Copy link

codecov bot commented Nov 3, 2023

Codecov Report

Attention: 33 lines in your changes are missing coverage. Please review.

Comparison is base (5a41542) 71.90% compared to head (7d1de31) 71.92%.

Files Patch % Lines
azurelinuxagent/ga/rsm_version_updater.py 80.95% 6 Missing and 6 partials ⚠️
azurelinuxagent/ga/ga_version_updater.py 85.71% 9 Missing ⚠️
azurelinuxagent/ga/agent_update_handler.py 88.05% 7 Missing and 1 partial ⚠️
azurelinuxagent/ga/self_update_version_updater.py 96.42% 1 Missing and 2 partials ⚠️
azurelinuxagent/common/conf.py 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2974      +/-   ##
===========================================
+ Coverage    71.90%   71.92%   +0.01%     
===========================================
  Files          106      109       +3     
  Lines        16105    16200      +95     
  Branches      2311     2313       +2     
===========================================
+ Hits         11581    11652      +71     
- Misses        3988     4005      +17     
- Partials       536      543       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.



class AgentUpdateHandler(object):

"""
This class handles two type of agent updates and chooses the appropriate updater based on the below conditions:
Copy link
Contributor Author

@nagworld9 nagworld9 Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially we only get requested version in GS to update the agent. This info not enough since version always populated by the CRP for non-rsm scenario too. So, we have no way to know version is from RSM deployment or not. So now CRP sends additional two flags to determine whether it's RSM update.
isVMEnabledForRSMUpgrades: This indicates that vm enabled for RSM upgrades.
isVersionFromRSM: This indicates version coming from RSM.

You may question why version and isVMEnabledForRSMUpgrades is not enough, why we need isVersionFromRSM?

Scenario: Existing vms

  1. Agent is on v1
  2. Deployed v2 to PIR but haven't started RSM deployment
  3. v1 updated to v2
  4. vm opt-in RSM and received new GS with flag true but requested version as v1
  5. v2 downgraded to v1

This leads to unintended downgrade. To avoid this, CRP sends last flag to indicate it's actually downgrade requested version by RSM deployment or not. If it's not we don't act on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated semantics

self.requested_version = version
# Set to None if the property not specified in the GS and later computed True/False based on previous state in agent update
self.is_version_from_rsm = None
# Set to None if this property not specified in the GS and later computed True/False based on previous state in agent update
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computed state in agent update since here it should reflect what we see it in goal state.

This state will be persisted throughout the current service run and might be modified by external classes.
"""
report_error_msg = ""
report_expected_version = FlexibleVersion("0.0.0.0")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I added it in updatestate class

isVMEnabledForRSMUpgrades: True if the VM is enabled for RSM upgrades.
if vm enabled for RSM upgrades, we use RSM update path. But if requested update is not by rsm deployment
we ignore the update.
This update is allowed once per (as specified in the conf.get_autoupdate_frequency())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this auto update frequency only used for RSM update? I see we have separate frequencies for hotfix and regular updates. If autoupdate_frequency is only used for RSM, could we update the name to differentiate it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same frequency used for manifest download in self-update, let me think better name

self.update_state.last_attempted_manifest_download_time = now
return True
self._is_version_from_rsm = self._get_is_version_from_rsm()
self._is_vm_enabled_for_rsm_upgrades = self._get_is_vm_enabled_for_rsm_upgrades()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_autoupdate_frequency() and get_autoupdate_frequency() already return False if not os.path.exists(self._get_rsm_version_state_file()):

I don't think you need the if/else condition here

if conf.get_enable_ga_versioning() and agent_family.is_requested_version_specified:
if agent_family.requested_version is not None:
return FlexibleVersion(agent_family.requested_version)
if agent_family.version is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will version always be in agent_family? Should we check here that the version property exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is after we parse the goal state, so the agent_family model will have the version always, it set to None if version not exist in GS


# we don't allow updates if version is not from RSM or downgrades below daemon version or if version is same as current version
if not self._is_version_from_rsm or self._version < self._daemon_version or self._version == CURRENT_VERSION:
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we hit this case then we don't end up doing the update, but we updated our attempt time to be now already in _is_update_allowed_this_time (line 400).

Is this the behavior we want? I think we shouldn't update the 'last_attempted_rsm_version_update_time' until we actually attempt the update

Copy link
Contributor Author

@nagworld9 nagworld9 Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is debatable. What I thought was, if we don't update the clock, if rsm request made to vms same time, then we end up do the update to all vms same time. Even though this is small window 1hr, updating clock help a bit to spread update

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment similar to @maddieford 's in a previous iteration of this code. I also find that confusing

return

agent_family = self._get_agent_family_manifest(goal_state)
version = self._get_version_from_gs(agent_family)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we trying to get the version from the goal state at this point? so far we haven't decided yet whether we are doing self-update or RSM update and the former does not need the version

None if requested version missing or GA versioning not enabled
Get the version from agent family
Returns: version if supported and available in the GS
None if version is missing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning None makes the code harder to write/review/maintain.

For self-update we should not even be looking at the version in the goal state. If it is missing, we do not care.

For RSM update, the version must not be missing, so you can raise an update exception here to report the error and skip the update, instead of returning None.

This is a consequence of my other comment below: that we are looking at the version before we have decided whether we are doing self-update or RSM update

gs_id = goal_state.extensions_goal_state.id
self._update_rsm_version_state_if_changed(goal_state.extensions_goal_state.created_on_timestamp, agent_family)
# if version is specified and vm is enabled for rsm upgrades, use rsm update path, else sef-update
if version is None and self._is_vm_enabled_for_rsm_upgrades and self._is_version_from_rsm:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check should be done when we are picking up those properties from the goal state

"""
This function downloads the new agent and returns the downloaded version.
"""
if self._agent_manifest is None: # Fetch agent manifest if it's not already done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of None in this class is also confusing/problematic. The init method takes all these parameters:

    def __init__(self, gs_id, agent_family, agent_manifest, version, update_state):
        self._gs_id = gs_id
        self._agent_family = agent_family
        self._agent_manifest = agent_manifest
        self._version = version
        self._update_state = update_state

Does this check for None on self.agent_manifest implies than any of these properties can be None and we need to check before using any of them? Some of them can be None, and some can't? How would the reader of this code know which ones can and which ones can't?

"""
This function downloads the new agent(requested version) and returns the downloaded version.
RSM version update:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments should probably be in the derived classes to make the code easier to read instead of having to scroll up and down to see the comments


def __download_and_get_agent(self, goal_state, agent_family, agent_manifest, requested_version):

class GAVersionUpdater(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GAVersionUpdater and its derived classes should probably be in a separate file. The rough rule of thumb is "one class per source file", small/related classes can share the same file)


def __download_and_get_agent(self, goal_state, agent_family, agent_manifest, requested_version):

class GAVersionUpdater(object):
Copy link
Member

@narrieta narrieta Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating those classes. It's a good step in the right direction. However, some of the state and logic that should be in these classes is still spread around other classes.

As an example, this state should be in these 2 derived classses:

    def __init__(self, gs_id, agent_family, agent_manifest, version, update_state):
        self._gs_id = gs_id
        self._agent_family = agent_family
        self._agent_manifest = agent_manifest
        self._version = version
        self._update_state = update_state

Same for these properties:

class AgentUpdateHandler(object):
    def __init__(self, protocol):
         ...
        # restore the state of rsm update
        if not os.path.exists(self._get_rsm_version_state_file()):
            self._is_version_from_rsm = False
            self._is_vm_enabled_for_rsm_upgrades = False
        else:
            self._is_version_from_rsm = self._get_is_version_from_rsm()
            self._is_vm_enabled_for_rsm_upgrades = self._get_is_vm_enabled_for_rsm_upgrades()

And this logic:

    def _get_is_version_from_rsm(self):
    def _get_is_vm_enabled_for_rsm_upgrades(self):
    def _get_rsm_state_used_gs_timestamp(self):
    def _update_rsm_version_state_if_changed(self, goalstate_timestamp, agent_family):

These are only 3 examples.

At this point you can't move this state to the updaters because you are creating them on each iteration of the main loop and this state needs to persist across iterations. But instantiating on each iteration is not needed. You only need to instantiate them only when there is a new goal state and the "is vm enabled for RSM updates" property changes. This approach would also address my other concern about having too much code/logic executing on each iteration of the main loop.

"""
raise NotImplementedError

def download_and_get_new_agent(self, protocol, goal_state):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue with the design of this method is that it takes the goal state as parameter and the init method takes several properties from the goal state. With this design, this goal state and the properties in init may correspond to different goal states. You may want to consider passing the goal state on init and remove it from here

Returns: None if fail to report or update never attempted with rsm version specified in GS
"""
try:
if conf.get_enable_ga_versioning() and self._is_vm_enabled_for_rsm_upgrades and self._is_version_from_rsm:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to check for self._is_version_from_rsm here? Does the protocol with CRP require that we do not report status when _is_vm_enabled_for_rsm_upgrades is true but _is_version_from_rsm is not?

"""
try:
with open(self._get_rsm_version_state_file(), "w") as file_:
json.dump({"isVersionFromRSM": isVersionFromRSM,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to persist isVersionFromRSM?

def get_autoupdate_frequency(conf=__conf__):
return conf.get_int("Autoupdate.Frequency", 3600)
def get_agentupdate_frequency(conf=__conf__):
return conf.get_int("AgentUpdate.Frequency", 3600)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not rename these parameters, since customers may be using them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree customer may have taken dependency, my intention is this frequency is shared between self and rsm update. So I thought auto update my confuse. Will revert it.

uris_list = find(ga_family, "Uris")
uris = findall(uris_list, "Uri")
family = VMAgentFamily(name, version)
family = VMAgentFamily(name)
if version is not None:
Copy link
Member

@narrieta narrieta Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this check for None, if VMAgentFamily initializes the property to None anyways?

this gives the impression that the property should not be None

return json.load(file_)["isLastUpdateWithRSM"]
except Exception as e:
logger.warn(
"Can't retrieve the isLastUpdateWithRSM from rsm state file ({0}), will assume it False. Error: {1}",
Copy link
Member

@narrieta narrieta Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming False may be incorrect.

the file just contains a boolean; you can eliminate this error condition by using the presence of the file as the boolean: if it exits, it's RSM, otherwise it's self-update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, even true is not right all the time. I feel assume false and self-update more safe than true. With rsm, we need extra flags to move forward with update. If it's not there we fail the update. That will stop the update but self-update at least customers vms on latest version. Later in time switch back to RSM if flags are showed up.

As far as presence file as boolean. I already check the presence in Line #100. So at this point we can assume true if that is what we want

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, assuming True is not safe either.

Hence my suggestion of just using an empty file, and if the file exist the value (is RSM) is True, otherwise it is False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I mean


def _get_is_last_update_with_rsm(self):
    return os.path.exists(self._get_rsm_update_state_file())

and change _save_rsm_update_stateto create the file when its argument is True and delete it when it is False.

if vm is enabled for RSM updates and continue with rsm update, otherwise we raise exception to switch to self-update.
if either isVersionFromRSM or isVMEnabledForRSMUpgrades is missing in the goal state, we ignore the update as we consider it as invalid goal state.
"""
if self._gs_id != gs_id:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should be done with the value returned by UpdateHandler._processing_new_extensions_goal_state()

@@ -151,211 +127,55 @@ def __get_agent_family_manifests(self, goal_state):
agent_family_manifests.append(m)

if not family_found:
raise AgentUpdateError(u"Agent family: {0} not found in the goal state, skipping agent update".format(family))
raise AgentUpdateError(u"Agent family: {0} not found in the goal state incarnation: {1}, skipping agent update".format(family, self._gs_id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"incarnation" applies only to Fabric goal states

return

if requested_version == CURRENT_VERSION:
# verify if agent update is allowed this time (RSM checks 1 hr interval; self-update checks manifest download interval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RSM updates need to be checked on every goal state, the 1-hour timer is just for retries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had concern we do the update on extension goal states too If we go with every goal state

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand your concern. We can go over it offline if you want.

My understanding is that the contract with CRP is that the Agent must process the goal state requesting the update, not 1 hour later (or less, depending on timing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, we can discuss in sync.

My point, how do you GS related to Agent update request not the extension operation request. If we allow on every goal state, for example if we get 10 new GS within 1-hour timer, we end up fetching the agent family 10 times and validate that version is already updated or not which is unnecessary. It seems there is some tradeoffs we need to do. If we want this thing, we have to pay the price of doing extra checks

"""
Get the agent version from the goal state
"""
if agent_family.version is None and agent_family.is_vm_enabled_for_rsm_upgrades and agent_family.is_version_from_rsm:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about encapsulating the use of those tri-state variables in 1 single method, otherwise they are difficult to use by callers (they always need to check for None). The method above, check_and_switch_updater_if_changed, is protecting itself against None, but this method is not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I felt version check None in check_and_switch_updater_if_changed method is not a good place. That's why I added in retrieve version method.

Still adding confusion, I can place it in same method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to encapsulate this in the GoalState class. Make the tri-state variables private and provide access to them only thru 1 method, let's say get_agent_update_info, which returns a struct with the two booleans and the string, and throws exceptions if the values in the goal state are invalid or missing.

Of course, there are other ways to do this. The idea is to limit the potential use of invalid values to 1 single method instead of having every single method that uses those properties checking for None.

"""
if self._gs_id != gs_id:
self._gs_id = gs_id
if conf.get_enable_ga_versioning() and agent_family.is_vm_enabled_for_rsm_upgrades:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another method that is not protecting itself against None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which None are you talking? is it is_vm_enabled_for_rsm_upgrades? That is intentional. In the self-update, I don't want to fail with error (stop the updates) if it's None rather I want to continue with self-update until flags showed up. Once the enabled flag is there, we should have other flag otherwise we don't switch to rsm-update.

Copy link
Member

@narrieta narrieta Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you checking for None on line 127? It looks like a check on a boolean. If you are checking against None, please do an explicit comparison against None.

Taking advantage that None evaluates as False in a bool context is OK for small, one-single-use scripts, not so good for production code in a large code base.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that, see my comment above wrt allowing those properties to be None. Please don't. Encapsulate the use of invalid values (None) in a single method.

This method ensure that update is allowed only once per (hotfix/Regular) upgrade frequency
"""
now = datetime.datetime.now()
next_hotfix_time, next_regular_time = self._get_next_upgrade_times(now)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to compute both the hotfix and regular times? at any given time, only one of them is valid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, before we compute I can determine hotfix or regualr version and depending on that compute one time. Makes sense

Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the updates, code improved a lot

added some extra comments

try:
# updater will raise exception if we need to switch to self-update or rsm update
self._updater.check_and_switch_updater_if_changed(agent_family, gs_id)
except VMDisabledRSMUpdates:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opting in and out of RSM are not error conditions, but the code communicates those transitions with exceptions, which is odd. Exceptions are used to communicate errors. Consider changing that to conditionals.


if next_attempt_time > now:
return False
self._last_attempted_rsm_version_update_time = now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we update the last_attempted_rsm_version_update_time before we actually proceed with the update? If for some reason we can't update to the provided version, then we would have to wait one whole frequency before attempting another update with a different version.

Not necessarily suggesting we change this, but just wanting to understand the decision

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been pointed out at least twice in previous iterations of this PR. I agree with @maddieford's comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the timer, now rsm updates happen on every GS

raise VMDisabledRSMUpdates()

if agent_family.is_vm_enabled_for_rsm_upgrades is None:
raise AgentUpdateError(
Copy link
Contributor

@maddieford maddieford Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have a VM which was enabled for RSM upgrades and then is switched to a host which does not send down the is_vm_enabled_for_rsm_upgrades, then we will not process any goal states?

What is the solution for customers who may end up getting blocked by this? I'm thinking we can use the GA versioning flag in conf to disable. Is there another way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed offline, there isn't much we can do about it.

@nagworld9 nagworld9 merged commit 56543ed into Azure:develop Dec 14, 2023
11 checks passed
nagworld9 added a commit that referenced this pull request Mar 27, 2024
* Add support for Azure Clouds (#2795)

* Add support for Azure Clouds
---------

Co-authored-by: narrieta <narrieta>

* Check certificates only if certificates are included in goal state and update test-requirements to remove codecov (#2803)

* Update version to dummy 1.0.0.0'

* Revert version change

* Only check certificats if goal state includes certs

* Fix code coverage deprecated issue

* Move condition to function call

* Add tests for no outbound connectivity (#2804)

* Add tests for no outbound connectivity

---------

Co-authored-by: narrieta <narrieta>

* Use cloud when validating test location (#2806)

* Use cloud when validating test location
---------

Co-authored-by: narrieta <narrieta>

* Redact access tokens from extension's output (#2811)

* Redact access tokens from extension's output

* python 2.6

---------

Co-authored-by: narrieta <narrieta>

* Add @GabstaMSFT as code owner (#2813)

Co-authored-by: narrieta <narrieta>

* Fix name of single IB device when provisioning RDMA (#2814)

The current code assumes the ipoib interface name is ib0 when single IB
interface is provisioned. This is not always true when udev rules are used
to rename to other names like ibPxxxxx.

Fix this by searching any interface name starting with "ib".

* Allow tests to run on random images (#2817)

* Allow tests to run on random images

* PR feedback

---------

Co-authored-by: narrieta <narrieta>

* Bug fixes for end-to-end tests (#2820)

Co-authored-by: narrieta <narrieta>

* Enable all Azure clouds on end-to-end tests (#2821)

Co-authored-by: narrieta <narrieta>

* Add Azure CLI to container image (#2822)

Co-authored-by: narrieta <narrieta>

* Fixes for Azure clouds (#2823)

* Fixes for Azure clouds

* add debug info

---------

Co-authored-by: narrieta <narrieta>

* Add test for extensions disabled; refactor VirtualMachine and VmExtension (#2824)

* Add test for extensions disabled; refactor VirtualMachine and VmExtension
---------

Co-authored-by: narrieta <narrieta>

* Fixes for end-to-end tests (#2827)

Co-authored-by: narrieta <narrieta>

* Add test for osProfile.linuxConfiguration.provisionVMAgent (#2826)

* Add test for osProfile.linuxConfiguration.provisionVMAgent

* add files

* pylint

* added messages

* ssh issue

---------

Co-authored-by: narrieta <narrieta>

* Enable suppression rules for waagent.log (#2829)

Co-authored-by: narrieta <narrieta>

* Wait for service start when setting up test VMs; collect VM logs when setup fails (#2830)

Co-authored-by: narrieta <narrieta>

* Add vm arch to heartbeat telemetry (#2818) (#2838)

* Add VM Arch to heartbeat telemetry

* Remove outdated vmsize heartbeat tesT

* Remove unused import

* Use platform to get vmarch

(cherry picked from commit 66e8b3d)

* Add regular expression to match logs from very old agents (#2839)

Co-authored-by: narrieta <narrieta>

* Increase concurrency level for end-to-end tests (#2841)

Co-authored-by: narrieta <narrieta>

* Agent update refactor supports GA versioning (#2810)

* agent update refactor (#2706)

* agent update refactor

* address PR comments

* updated available agents

* fix pylint warn

* updated test case warning

* added kill switch flag

* fix pylint warning

* move last update attempt variables

* report GA versioning supported feature. (#2752)

* control agent updates in e2e tests and fix uts (#2743)

* disable agent updates in dcr and fix uts

* address comments

* fix uts

* report GA versioning feature

* Don't report SF flag idf auto update is disabled (#2754)

* fix uts (#2759)

* agent versioning test_suite (#2770)

* agent versioning test_suite

* address PR comments

* fix pylint warning

* fix update assertion

* fix pylint error

* logging manifest type and don't log same error until next period in agent update. (#2778)

* improve logging and don't log same error until next period

* address comments

* update comment

* update comment

* Added self-update time window. (#2794)

* Added self-update time window

* address comment

* Wait and retry for rsm goal state (#2801)

* wait for rsm goal state

* address comments

* Not sharing agent update tests vms and added scenario to daily run (#2809)

* add own vm property

* add agent_update to daily run

* merge conflicts

* address comments

* address comments

* additional comments addressed

* fix pylint warning

* Add test for FIPS (#2842)

* Add test for FIPS

* add test

* increase sleep

* remove unused file

* added comment

* check uptime

---------

Co-authored-by: narrieta <narrieta>

* Eliminate duplicate list of test suites to run (#2844)

* Eliminate duplicate list of test suites to run

* fix paths

* add agent update

---------

Co-authored-by: narrieta <narrieta>

* Port NSBSD system to the latest version of waagent (#2828)

* nsbsd: adapt to recent dns.resolver

* osutil: Provide a get_root_username function for systems where its not 'root' (like in nsbsd)

* nsbsd: tune the configuration filepath

* nsbsd: fix lib installation path

---------

Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>

* Fix method name in update test (#2845)

Co-authored-by: narrieta <narrieta>

* Expose run name as a runbook variable (#2846)

Co-authored-by: narrieta <narrieta>

* Collect test artifacts as a separate step in the test pipeline (#2848)

* Collect test artifacts as a separate step in the test pipeline
---------

Co-authored-by: narrieta <narrieta>

* remove agent update test and py27 version from build (#2853)

* Fix infinite retry loop in end to end tests (#2855)

* Fix infinite retry loop

* fix message

---------

Co-authored-by: narrieta <narrieta>

* Remove empty "distro" module (#2854)

Co-authored-by: narrieta <narrieta>

* Enable Python 2.7 for unit tests (#2856)

* Enable Python 2.7 for unit tests

---------

Co-authored-by: narrieta <narrieta>

* Skip downgrade if requested version below daemon version (#2850)

* skip downgrade for agent update

* add test

* report it in status

* address comments

* revert change

* improved error msg

* address comment

* update location schema and added skip clouds in suite yml (#2852)

* update location schema in suite yml

* address comments

* .

* pylint warn

* comment

* Do not collect LISA logs by default (#2857)

Co-authored-by: narrieta <narrieta>

* Add check for noexec on Permission denied errors (#2859)

* Add check for noexec on Permission denied errors

* remove type annotation

---------

Co-authored-by: narrieta <narrieta>

* Wait for log message in AgentNotProvisioned test (#2861)

* Wait for log message in AgentNotProvisioned test

* hardcoded value

---------

Co-authored-by: narrieta <narrieta>

* Always collect logs on end-to-end tests (#2863)

* Always collect logs

* cleanup

---------

Co-authored-by: narrieta <narrieta>

* agent publish scenario (#2847)

* agent publish

* remove vm size

* address comments

* deamom version fallback

* daemon versionfix

* address comments

* fix pylint error

* address comment

* added error handling

* add time window for agent manifest download (#2860)

* add time window for agent manifest download

* address comments

* address comments

* ignore 75-persistent-net-generator.rules in e2e tests (#2862)

* ignore 75-persistent-net-generator.rules in e2e tests

* address comment

* remove

* Always publish artifacts and test results (#2865)

Co-authored-by: narrieta <narrieta>

* Add tests for extension workflow (#2843)

* Update version to dummy 1.0.0.0'

* Revert version change

* Basic structure

* Test must run in SCUS for test ext

* Add GuestAgentDCRTest Extension id

* Test stucture

* Update test file name

* test no location

* Test location as southcentralus

* Assert ext is installed

* Try changing version for dcr test ext

* Update expected message in instance view

* try changing message to string

* Limit images for ext workflow

* Update classes after refactor

* Update class name

* Refactor tests

* Rename extension_install to extension_workflow

* Assert ext status

* Assert operation sequence is expected

* Remove logger reference

* Pass ssh client

* Update ssh

* Add permission to run script

* Correct permissions

* Add execute permissions for helper script

* Make scripts executable

* Change args to string

* Add required parameter

* Add shebang for retart_agent

* Fix arg format

* Use restart utility

* Run restart with sudo

* Add enable scenario

* Attempt to remove start_time

* Only assert enable

* Add delete scenario

* Fix uninstall scenario

* Add extension update scenario

* Run assert scenario on update scenario

* Fix reference to ext

* Format args as str instead of arr

* Update test args

* Add test case for update without install

* Fix delete

* Keep changes

* Save changes

* Add special chars test case

* Fix dcr_ext issue{

* Add validate no lag scenario

* Fix testguid reference

* Add additional log statements for debugging

* Fix message to check before encoding

* Encode setting name

* Correctly check data

* Make check data executable

* Fix command args for special char test

* Fix no lag time

* Fix ssh client reference

* Try message instead of text

* Remove unused method

* Start clean up

* Continue code cleanup

* Fix pylint errors

* Fix pylint errors

* Start refactor

* Debug agent lag

* Update lag logging

* Fix assert_that for lag

* Remove typo

* Add readme for extension_workflow scenario

* Reformat comment

* Improve logging

* Refactor assert scenario

* Remove unused constants

* Remove unusued parameter in assert scenario

* Add logging

* Improve logging

* Improve logging

* Fix soft assertions issue

* Remove todo for delete polling

* Remove unnecessary new line

* removed unnecessary function

* Make special chars log more readable

* remove unnecessary log

* Add version to add or update log

* Remove unnecessary assert instance view

* Add empty log line

* Add update back to restart args to debug

* Add update back to restart args to debug

* Remove unused init

* Remove test_suites from pipeline yml

* Update location in test suite yml

* Add comment for location restriction

* Remove unused init and fix comments

* Improve method header

* Rename scripts

* Remove print_function

* Rename is_data_in_waagent_log

* Add comments describing assert operation sequence script

* add comments to scripts and type annotate assert operation sequence

* Add GuestAgentDcrExtension source code to repo

* Fix typing.dict error

* Fix typing issue

* Remove outdated comment

* Add comments to extension_workflow.py

* rename scripts to match test suite name

* Ignore pylint warnings on test ext

* Update pylint rc to ignore tests_e2e/GuestAgentDcrTestExtension

* Update pylint rc to ignore tests_e2e/GuestAgentDcrTestExtension

* disable all errors/warnings dcr test ext

* disable all errors/warnings dcr test ext

* Run workflow on debian

* Revert to dcr config distros

* Move enable increment to beginning of function

* Fix gs completed regex

* Remove unnessary files from dcr test ext dir

* Update agent_ext_workflow.yml to skip China and Gov clouds (#2872)

* Update agent_ext_workflow.yml to skip China and Gov clouds

* Update tests_e2e/test_suites/agent_ext_workflow.yml

* fix daemon version (#2874)

* Wait for extension goal state processing before checking for lag in log (#2873)

* Update version to dummy 1.0.0.0'

* Revert version change

* Add sleep time to allow goal state processing to complete before lag check

* Add retry logic to gs processing lag check

* Clean up retry logic

* Add back empty line

* Fix timestamp parsing issue

* Fix timestamp parsing issue

* Fix timestamp parsing issue

* Do 3 retries{

* Extract tarball with xvf during setup (#2880)

In a pipeline run we saw the following error when extracting the tarball on the test node:

Adding v to extract the contents with verbose

* enable agent update in daily run (#2878)

* Create Network Security Group for test VMs (#2882)

* Create Network Security Group for test VMs

* error handling

---------

Co-authored-by: narrieta <narrieta>

* don't allow downgrades for self-update (#2881)

* don't allow downgrades for self-update

* address comments

* update comment

* add logger

* Supress telemetry failures from check agent log (#2887)

Co-authored-by: narrieta <narrieta>

* Install assertpy on test VMs (#2886)

* Install assertpy on test VMs

* set versions

---------

Co-authored-by: narrieta <narrieta>

* Add sample remote tests (#2888)

* Add sample remote tests

* add pass

* review feedback

---------

Co-authored-by: narrieta <narrieta>

* Enable Extensions.Enabled in tests (#2892)

* enable Extensions.Enabled

* address comment

* address comment

* use script

* improve msg

* improve msg

* Reorganize file structure of unit tests (#2894)

* Reorganize file structure of unit tests

* remove duplicate

* add init

* mocks

---------

Co-authored-by: narrieta <narrieta>

* Report useful message when extension processing is disabled (#2895)

* Update version to dummy 1.0.0.0'

* Revert version change

* Fail GS fast in case of extensions disabled

* Update extensions_disabled scenario to look for GS failed instead of timeout when extensions are disabled

* Update to separate onHold and extensions enabled

* Report ext disabled error in handler status

* Try using GoalStateUnknownFailure

* Fix indentation error

* Try failing ext handler and checking logs

* Report ext processing error

* Attempt to fail fast

* Fix param name

* Init error

* Try to reuse current code

* Try to reuse current code

* Clean code

* Update scenario tests

* Add ext status file to fail fast

* Fail fast test

* Report error when ext disabled

* Update timeout to 20 mins

* Re enable ext for debugging

* Re enable ext for debugging

* Log agent status update

* Create ext status file with error code

* Create ext status file with error code

* We should report handler status even if not installed in case of extensions disabled

* Clean up code change

* Update tests for extensions disabled

* Update test comment

* Update test

* Remove unused line

* Remove ununsed timeout

* Test failing case

* Remove old case

* Remove unused import

* Test multiconfig ext

* Add multi-config test case

* Clean up test

* Improve logging

* Fix dir for testfile

* Remove ignore error rules

* Remove ununsed imports

* Set handler status to not ready explicitly

* Use OS Util to get agent conf path

* Retry tar operations after 'Unexpected EOF in archive' during node setup (#2891)

* Update version to dummy 1.0.0.0'

* Revert version change

* Capture output of the copy commands during setup

* Add verbose to copy command

* Update typing for copy to node methods

* Print contents of tar before extracting

* Print contents of tar before extracting

* Print contents of tar before extracting

* Print contents of tar before extracting

* Retry copying tarball if contents on test node do not match

* Revert copy method def

* Revert copy method def

* Catch EOF error

* Retry tar operations if we see failure

* Revert target_path

* Remove accidental copy of exception

* Remove blank line

* tar cvf and copy commands overwrite

* Add log and telemetry event for extension disabled (#2897)

* Update version to dummy 1.0.0.0'

* Revert version change

* Add logs and telemetry for processing extensions when extensions disabled

* Reformat string

* Agent status scenario (#2875)

* Update version to dummy 1.0.0.0'

* Revert version change

* Create files for agent status scenario

* Add agent status test logic

* fix pylint error

* Add comment for retry

* Mark failures as exceptions

* Improve messages in logs

* Improve comments

* Update comments

* Check that agent status updates without processing additional goal states 3 times

* Remove unused agent status exception

* Update comment

* Clean up comments, logs, and imports

* Exception should inherit from baseexception

* Import datetime

* Import datetime

* Import timedelta

* instance view time is already formatted

* Increse status update time

* Increse status update time

* Increse status update time

* Increase timeout

* Update comments and timeoutS

* Allow retry if agent status timestamp isn't updated after 30s

* Remove unused import

* Update time value in comment

* address PR comments

* Check if properties are None

* Make types & errors more readable

* Re-use vm_agent variable

* Add comment for dot operator

* multi config scenario (#2898)

* Update version to dummy 1.0.0.0'

* Revert version change

* multi config scenario bare bones

* multi config scenario bare bones

* Stash

* Add multi config test

* Run on arm64

* RCv2 is not supported on arm64

* Test should own VM

* Add single config ext to test

* Add single config ext to test

* Do not fail test if there are unexpected extensions on the vm

* Update comment for accuracy

* Make resource name parameter optional

* Clean up code

* agent and ext cgroups scenario (#2866)

* agent-cgroups scenario

* address comments

* address comments

* fix-pylint

* pylint warn

* address comments

* improved logging"

* improved ext cgroups scenario

* new changes

* pylint fix

* updated

* address comments

* pylint warn

* address comment

* merge conflicts

* agent firewall scenario (#2879)

* agent firewall scenario

* address comments

* improved logging

* pylint warn

* address comments

* updated

* address comments

* pylint warning

* pylint warning

* address comment

* merge conflicts

* Add retry and improve the log messages in agent update test (#2890)

* add retry

* improve log messages

* merge conflicts

* Cleanup common directory (#2902)

Co-authored-by: narrieta <narrieta>

* improved logging (#2893)

* skip test in mooncake and usgov (#2904)

* extension telemetry pipeline scenario (#2901)

* Update version to dummy 1.0.0.0'

* Revert version change

* Barebones for etp

* Scenario should own VM because of conf change

* Add extension telemetry pipeline test

* Clean up code

* Improve log messages

* Fix pylint errors

* Improve logging

* Improve code comments

* VmAccess is not supported on flatcar

* Address PR comments

* Add support_distros in VmExtensionIdentifier

* Fix logic for support_distros in VmExtensionIdentifier

* Use run_remote_test for remote script

* Ignore logcollector fetch failure if it recovers (#2906)

* download_fail unit test should use agent version in common instead of 9.9.9.9 (#2908) (#2912)

(cherry picked from commit ed80388)

* Download certs on FT GS after check_certificates only when missing from disk (#2907) (#2913)

* Download certs on FT GS only when missing from disk

* Improve telemetry for inconsistent GS

* Fix string format

(cherry picked from commit c13f750)

* Update pipeline.yml to increase timeout to 90 minutes (#2910)

Runs have been timing out after 60 minutes due to multiple scenarios sharing VMs

* Fix agent memory usage check (#2903)

* fix memory usage check

* add test

* added comment

* fix test

* disable ga versioning changes (#2917)

* Disable ga versioning changes (#2909)

* disbale rsm changes

* add flag

(cherry picked from commit 5a4fae8)

* merge conflicts

* fix the ignore rule in agent update test (#2915) (#2918)

* ignore the agent installed version

* address comments

* address comments

* fixes

(cherry picked from commit 8985a42)

* Use Mariner 2 in FIPS test (#2916)

* Use Mariner 2 in FIPS test
---------

Co-authored-by: narrieta <narrieta>

* Change pipeline timeout to 90 minutes (#2925)

* fix version checking (#2920)

Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>

* mariner container image (#2926)

* mariner container image

* added packages repo

* addressed comments

* addressed comments

* Fix for "local variable _COLLECT_NOEXEC_ERRORS referenced before assignment" (#2935)

* Fix for "local variable _COLLECT_NOEXEC_ERRORS referenced before assignment"

* pylint

---------

Co-authored-by: narrieta <narrieta>

* fix agent manifest call frequency (#2923) (#2932)

* fix agent manifest call frequency

* new approach

(cherry picked from commit 6554032)

* enable rhel/centos cgroups (#2922)

* Add support for EC certificates (#2936)

* Add support for EC certificates

* pylint

* pylint

* typo

---------

Co-authored-by: narrieta <narrieta>

* Add Cpu Arch in local logs and telemetry events (#2938)

* Add cpu arch to telem and local logs

* Change get_vm_arch to static method

* update unit tests

* Remove e2e pipeline file

* Remove arch from heartbeat

* Move get_vm_arch to osutil

* fix syntax issue

* Fix unit test

* skip cgorup monitor (#2939)

* Clarify support status of installing from source. (#2941)

Co-authored-by: narrieta <narrieta>

* agent cpu quota scenario (#2937)

* agent_cpu_quota scenario

* addressed comments

* addressed comments

* skip test version install (#2950)

* skip test install

* address comments

* pylint

* local run stuff

* undo

* Add support for VM Scale Sets to end-to-end tests (#2954)

---------

Co-authored-by: narrieta <narrieta>

* Ignore dependencies when the extension does not have any settings (#2957) (#2962)

* Ignore dependencies when the extension does not have any settings

* Remove message

---------

Co-authored-by: narrieta <narrieta>
(cherry picked from commit 79bc12c)

* Cache daemon version (#2942) (#2963)

* cache daemon version

* address comments

* test update

(cherry picked from commit 279d557)

* update warning message (#2946) (#2964)

(cherry picked from commit 33552ee)

* fix self-update frequency to spread over 24 hrs for regular type and 4 hrs for hotfix  (#2948) (#2965)

* update self-update frequency

* address comment

* mark with comment

* addressed comment

(cherry picked from commit f15e6ef)

* Reduce the firewall check period in agent firewall tests (#2966)

* reduce firewall check period

* reduce firewall check period

* undo get daemon version change (#2951) (#2967)

* undo daemon change

* pylint

(cherry picked from commit fabe7e5)

* disable agent update (#2953) (#2968)

(cherry picked from commit 9b15b04)

* Change agent_cgroups to own Vm (#2972)

* Change cgroups to own Vm

* Agent cgroups should own vm

* Check SSH connectivity during end-to-end tests (#2970)

Co-authored-by: narrieta <narrieta>

* Gathering Guest ProxyAgent Log Files (#2975)

* Remove debug info from waagent.status.json (#2971)

* Remove debug info from waagent.status.json

* pylint warnings

* pylint

---------

Co-authored-by: narrieta <narrieta>

* Extension sequencing scenario (#2969)

* update tests

* cleanup

* .

* .

* .

* .

* .

* .

* .

* .

* .

* Add new test cases

* Update scenario to support new tests

* Scenario should support failing extensions and extensions with no settings

* Clean up test

* Remove locations from test suite yml

* Fix deployment issue

* Support creating multiple resource groups for vmss in one run

* AzureMonitorLinuxAgent is not supported on flatcar

* AzureMonitor is not supported on flatcar

* remove agent update

* Address PR comments

* Fix issue with getting random ssh client

* Address PR Comments

* Address PR Comments

* Address PR comments

* Do not keep rg count in runbook

* Use try/finally with lock

* only check logs after scenario startS

* Change to instance member

---------

Co-authored-by: narrieta <narrieta>

* rename log file for agent publish scenario (#2956)

* rename log file

* add param

* address comment

* Fix name collisions on resource groups created by AgentTestSuite (#2981)

Co-authored-by: narrieta <narrieta>

* Save goal state history explicitly (#2977)

* Save goal state explicitly

* typo

* remove default value in internal method

---------

Co-authored-by: narrieta <narrieta>

* Handle errors when adding logs to the archive (#2982)

Co-authored-by: narrieta <narrieta>

* Timing issue while checking cpu quota (#2976)

* timing issue

* fix pylint"

* undo

* Use case-insentive match when cleaning up test resource groups (#2986)

Co-authored-by: narrieta <narrieta>

* Update supported Ubuntu versions (#2980)

* Fix pylint warning (#2988)

Co-authored-by: narrieta <narrieta>

* Add information about HTTP proxies (#2985)

* Add information about HTTP proxies

* no_proxy

---------

Co-authored-by: narrieta <narrieta>

* agent persist firewall scenario (#2983)

* agent persist firewall scenario

* address comments

* new comments

* GA versioning refactor plus fetch new rsm properties. (#2974)

* GA versioning refactor

* added comment

* added abstract decorator

* undo abstract change

* update names

* addressed comments

* pylint

* agent family

* state name

* address comments

* conf change

* Run remote date command to get test case start time (#2993)

* Run remote date command to get test case start time

* Remove unused import

* ext_sequencing scenario: get enable time from extension status files (#2992)

* Get enable time from extension status files

* Check for empty array

* add status example in comments

* ssh connection retry on restarts (#3001)

* Add e2e test scenario for hostname monitoring (#3003)

* Validate hostname is published

* Run on distro without known issues

* Add comment about debugging network down

* Create e2e scenario for hostname monitoring

* Remove unused import

* Increase timeout for hostname change

* Add password to VM and check for agent status if ssh fails

* run scenario on all endorsed distros

* Use getdistro() to check distro

* Add comment to get_distro

* Add publish_hostname to runbook

* Make get_distro.py executable

* Address first round of PR comments

* Do not enable hostname monitoring on distros where it is disabled

* Skip test on ubuntu

* Update get-waagent-conf-value to remove unused variable

* AMA is not supported on cbl-mariner 1.0 (#3002)

* Cbl-mariner 1.0 is not supported by AMA

* Use get distro to check distro

* Add comment to get_distro

* log update time for self updater (#3004)

* add update time log

* log new agent update time

* fix tests

* Fix publish hostname in china and gov clouds (#3005)

* Fix regex to parse china/gov domain names

* Improve regex

* Improve regex

* Self update e2e test (#3000)

* self-update test

* addressed comments

* fix tests

* log

* added comment

* merge conflicts

* Lisa should not cleanup failed environment if keep_environment=failed (#3006)

* Throw exception for test suite if a test failure occurs

* Remove unused import

* Clean up

* Add comment

* fix(ubuntu): Point to correct dhcp lease files (#2979)

From Ubuntu 18.04, the default dhcp client was systemd-networkd.
However, WALA has been checking for the dhclient lease files.
This PR seeks to correct this bug.Interestingly, it was already
configuring systemd-networkd but checking for dhclient lease files.

Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>

* Use self-hosted pool for automation runs (#3007)

Co-authored-by: narrieta <narrieta>

* Add distros which use Python 2.6 (for reference only) (#3009)

Co-authored-by: narrieta <narrieta>

* Move cleanup pipeline to self-hosted pool (#3010)

Co-authored-by: narrieta <narrieta>

* NM should not be restarted during hostname publish if NM_CONTROLLED=y (#3008)

* Only restart NM if NM_controlled=n

* Clean up code

* Clean up code

* improve logging

* Make check on NM_CONTROLLED value sctrict

* Install missing dependency (jq) on Azure Pipeline Agents (#3013)

* Install missing dependency (jq) on Azure Pipeline Agents

* use if statement

* remove if statement

---------

Co-authored-by: narrieta <narrieta>

* Do not reset the mode of a extension's log directory (#3014)

Co-authored-by: narrieta <narrieta>

* Daemon should remove stale published_hostname file and log useful warning (#3016)

* Daemon should remove published_hostname file and log useful warning

* Clean up fast track file if vm id has changed

* Clean up initial_goal_state file if vm id has changed

* Clean up rsm_update file if vm id has changed

* Do not report TestFailedException in test results (#3019)

Co-authored-by: narrieta <narrieta>

* skip agent update run on arm64 distros (#3018)

* Clean test VMs older than 12 hours (#3021)

Co-authored-by: narrieta <narrieta>

* honor rsm update with no time when agent receives new GS (#3015)

* honor rsm update immediately

* pylint

* improve msg

* address comments

* address comments

* address comments

* added verbose logging

* Don't check Agent log from the top after each test suite (#3022)

* Don't check Agent log from the top after each test suite

* fix initialization of override

---------

Co-authored-by: narrieta <narrieta>

* update the proxy agenet log folder for logcollector (#3028)

* Log instance view before asserting (#3029)

* Add config parameter to wait for cloud-init (Extensions.WaitForCloudInit) (#3031)

* Add config parameter to wait for cloud-init (Extensions.WaitForCloudInit)

---------

Co-authored-by: narrieta <narrieta>

* Revert changes to publish_hostname in RedhatOSModernUtil (#3032)

* Revert changes to publish_hostname in RedhatOSModernUtil

* Fix pylint bad-super-call

* Remove agent_wait_for_cloud_init from automated runs (#3034)

Co-authored-by: narrieta <narrieta>

* Adding AutoUpdate.UpdateToLatestVersion new flag support (#3020)

* support new flag

* address comments

* added more info

* updated

* address comments

* resolving comment

* updated

* Retry get instance view if only name property is present (#3036)

* Retry get instance view if incomplete during assertions

* Retry getting instance view if only name property is present

* Fix regex in agent extension workflow (#3035)

* Recover primary nic if down after publishing hostname in RedhatOSUtil (#3024)

* Check nic state and recover if down:

* Fix typo

* Fix state comparison

* Fix pylint errors

* Fix string comparison

* Report publish hostname failure in calling thread

* Add todo to check nic state for all distros where we reset network

* Update detection to check connection state and separate recover from publish

* Pylint unused argument

* refactor recover_nic argument

* Network interface e2e test

* e2e test for recovering the network interface on redhat distros

* Only run scenario on distros which use RedhatOSUtil

* Fix call to parent publish_hostname to include recover_nic arg

* Update comments in default os util

* Remove comment

* Fix comment

* Do not do detection/recover on RedhatOSMOdernUtil

* Resolve PR comments

* Make script executable

* Revert pypy change

* Fix publish hostname paramters

* Add recover_network_interface scenario to runbook (#3037)

* Implementation of new conf flag AutoUpdate.UpdateToLatestVersion support (#3027)

* GA update to latest version flag

* address comments

* resloving comments

* added TODO

* ignore warning

* resolving comment

* address comments

* config present check

* added a comment

* Fix daily pipeline failures for recover_network_interface (#3039)

* Fix daily pipeline failures for recover_network_interface

* Clear any unused settings properties when enabling cse

---------

Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>

* Keep failed VMs by default on pipeline runs (#3040)

* enable RSM e2e tests (#3030)

* enable RSM tests

* merge conflicts

* Check for 'Access denied' errors when testing SSH connectivity (#3042)

Co-authored-by: narrieta <narrieta>

* Add Ubuntu 24 to end-to-end tests (#3041)

* Add Ubuntu 24 to end-to-end tests

* disable AzureMonitorLinuxAgent

---------

Co-authored-by: narrieta <narrieta>

* Skip capture of VM information on test runs (#3043)

Co-authored-by: narrieta <narrieta>

* Create symlink for waagent.com on Flatcar (#3045)

Co-authored-by: narrieta <narrieta>

* don't allow agent update if attempts reached max limit (#3033)

* set max update attempts

* download refactor

* pylint

* disable RSM updates (#3044)

* Skip test on alma and rocky until we investigate (#3047)

* fix agent update UT (#3051)

* version update to 2.10.0.8 (#3050)

* modify agent update flag (#3053)

---------

Co-authored-by: Norberto Arrieta <narrieta@users.noreply.github.com>
Co-authored-by: maddieford <93676569+maddieford@users.noreply.github.com>
Co-authored-by: Long Li <longli@microsoft.com>
Co-authored-by: sebastienb-stormshield <sebastien.bini@stormshield.eu>
Co-authored-by: Zheyu Shen <arsdragonfly@gmail.com>
Co-authored-by: Zhidong Peng <zpeng@microsoft.com>
Co-authored-by: d1r3ct0r <mwadimemakokha@gmail.com>
Co-authored-by: narrieta <narrieta>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants